Softmax Regression

In previous sections, we explored linear regression and its implementations, both from scratch and using high-level APIs. Regression models are typically used for quantitative outputs such as predicting prices, number of wins, or the number of days a patient might stay in the hospital. However, not all problems are best served by regression models due to the nature of their outputs. This leads to special cases like logarithmic regression or survival modeling.

Classification Problems

Classification shifts the focus from "how much?" to "which category?" questions. Examples include determining whether an email is spam, predicting customer behavior, or identifying objects in images. Classification can be categorized into:

Hard Classification: Direct assignment to a category.
Soft Classification: Probability of category membership.
Multi-label Classification: Items can concurrently belong to multiple categories.

Examples of Classification Questions:

Does this email belong in the spam folder or the inbox?
Is this customer likely to subscribe to a service?
What type of animal is depicted in this image?
Which movie will someone likely watch next?

Image Classification Problem Setup

Consider a simple example where each input is a 2x2 grayscale image, represented by four features. Each image is classified into one of three categories: cat, chicken, or dog.

Label Representation

Natural Encoding: Using integers such as $y \in \{1, 2, 3\}$ , where each number represents a different category.
One-hot Encoding: Each category is represented by a binary vector: $y \in \{(1, 0, 0),\ (0, 1, 0),\ (0, 0, 1)\}$

Linear Model for Classification

To handle multiple categories, we use multiple affine functions—one per category. For example, with four features and three categories, we require a total of 12 weights and three biases. The model outputs are given by:

\begin{aligned} o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,\\ o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,\\ o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3. \end{aligned}

This setup is equivalent to a single-layer fully connected neural network.

The Softmax Operation

The softmax function converts the linear outputs to probabilities by applying the exponential function and normalizing these values so that they sum to one. It is defined as:

\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \quad \textrm{where}\quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}.

This function ensures that the output values are non-negative and sum to 1, which are necessary properties for probabilities.

Vectorization for Efficiency

For computational efficiency, especially when processing minibatches of data, we use vectorized operations. This involves matrix-matrix multiplications which are computationally faster and more suitable for modern computing architectures.

Loss Function for Softmax Regression

We utilize the cross-entropy loss, a common choice for classification tasks. This loss measures the difference between the predicted probabilities and the actual distribution represented by the one-hot encoded labels:

l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j.

This setup corresponds to maximizing the likelihood of the observed labels given the predictions, providing a probabilistic grounding to learning the model parameters.